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® A massively parallel processor apparatus having an instruction set architecture for each of the r\P the PEs of 
the structure. The apparatus which we prefer will have a PE structure consisting of PEs that contain instruction 
and data storage units, receive instructions and data, and execute instructions. The structure should contain 
"N" communicating ALU trees, "N" programmable root tree processor units, and an arrangement for commu- 
nicating both instructions, data, and the root tree processor outputs back to the input processing elements by 
means of the communicating ALU trees. The apparatus can be structured as a bit-serial or word parallel system. 
The preferred structure contains V 2 PEs, identified as PEcohjmn.ro*, in a N root tree processor system, placed in 
the form of a N by N processor array that has been folded afong the diagonal and made up of diagonal cells and 
general cells. The Diagonal-Cells are comprised of a single processing element identified as PE M of the folded N 
by N processor array and the General-Cells are comprised of two PEs merged together, identified as PE U and 
PEp of the folded N by N processor array. Matrix processing algorithms are discussed followed by a 
presentation of the Diagonal-Fold Tree Array Processor architecture. The Massively Parallel Diagonal-Fold Tree 
Array Processor supports completely connected root tree processors through the use of the array of PEs that 
are interconnected by forded communication ALU trees. 
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FIELD OF THE INVENTION 

These invention relate to computers and particularly to massively parallel array processors. 
5 REFERENCES USED IN THE DISCUSSION OF THE INVENTION 

f OJE Rumelhart, J. L McClelland, and the POP Research Group, Parallel Distributed Processing Vol 
I r ou " d * Uons Cambridge, Massachusetts: MIT Press 1986. (Herein referred to as "Rumelhart 86" } 
2. E B. Eichelberger and T. W. Williams. "A Logic Design Structure for Testability.- Proceedinos 14th 
ro Design Automation Conference, IEEE. 1 977. (Herein referred to as "Eichelberger >7n 

^ t J c?**^ *' Neurons Wjth G '**<* Response Have Collective Computational Properties Like Those 

f 9 i W m 7°^" PrOCee * n * s ° f th ° "*<™l A -^y of Sciences 61, pp ' 308 ^-3092 May 
1984. (Herein referred to as "Hopfield 84".} w ' y> 

Ui J ' t^E""'* " Neural Networks ^ysical Systems with Emergent Collective Comoutational 
-s Ab.l,t,es." Proceedings of the National Academy of Sciences 79. pp. 2M4-25W 2 2 
referred to as "Hopfield 82".) °°' ("erein 

5. M.J. Flynn. J.D. Johnson, and S.P. Wakefield, "On Instruction Sets and Their Formats » IEEE 
Tractions on Combers Vol. C-34. No. 3. pp. 242-264. March ,986. <>Cin Xr2 To as' "F.ynn 

M ,„ I" *T " SVer en<Ji " 9 queSt ,0r faster engineers are unking hundreds and even thousands of 

.nw cost microprocessors together in parallel ,o create super supercomputers that divfde in cSeTto ™^ 
complex problems that stump today's machines. Such machines ar cailed maSvely £j lei S2£ 

Amdahl. Hitachi. Fujitsu and NEC ' m " nufl »™ including those of 

^™W«:Z™1^7<Z^£T ? 33 ^ * computers, 
and program tam S opera* ToL^i? Th«T h ^"P™^ a " ^connection network 
: Some of these machines have h£TSLJ m^lST ^ m0d6S * 0p6ra,i ° n 0f 

mode machines. Perhaps moTcommL^rf ^' °' machines have ^n SfMD 

Machines series , and To^SS, MacWnTs L ^STt maCh '' n6S 1,660 the **>«*°" 

^ "•rrrtts.-ss r; ^^tasr-r -* -° uw 

amount of memory, and communication link* to fhIw.J 1 1 P ^ 8 a Sfn9,e P™ 0 *** a small 
in order to build Z the sZT^^^^^TZr^ TfL * ^ 
In addition switches, like a IMS C004 woi.w h« ^«h=T. 6 wm wd 0012 wouk) *» connected, 
link inputs and 32 link outputs iSSiS Z nSLTT' ?* * SWitCh b «"™ »» 32 

addition, there will be sped* U^cS al^nTJLTZ 
so speaal purpose tailored toSe ^SZJEJ I! * transpolBre ada P^9 ^m to be used for a 

IMS M2 P 2 is a 6 M process X oTcSL LlZT't"*""' * ° raphiCS ° r diSk contro,ter - ™° •""»» 
logic to control disk driveTano STi ^ com K munica,io " « contains hardware and 

interface. In order to ^tL^ZeZv SrZ ZZ^T 9 *" C ° ntr0 " 6r « 85 8 ° enaral 
for transputer. Programme^have to IT 6 m 0ccam - 

« Some of these MP macTnes u S ^ZZ^T^ZsoiT^ ^ " ^ 

sTme^sSlV^ 

sors and there associated cS^^JS^ ^S^r^ 
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switches as processor addressable networks. Generally, as with the 14 RlSC/6000s which were (nterconec- 
ted last fall at Lawarence Liver more by wiring the machines together, the processor addressable networks 
have been considered as coarse-grained multiprocessors. 

Some very large machines are being built by Intel and nCube and others to attack what are called 
5 "grand challenges" in data processing. However, these computers are very expensive. Recent projected 
costs are in the order of $30,000,000.00 to $75,000,000.00 (Tera Computer) for computers whose 
development has been funded by the U.S. Government to attack the "grand challenges". These "grand 
challenges" would include such problems as climate modeling, fluid turbulence, pollution dispersion, 
mapping of the human genome and ocean circulation, quantum chromodynamics, semiconductor and 
io supercomputer modeling, combusion systems, vision and cognition. 

Our Massively Parallel DiagonaJ-Fold Tree Array Processor architecture, which is the subject of this 
patent, is applicable for modeling high computational parallel data algorithms, for example matrix processing 
and high connectivity neural networks. To demonstrate the general processing capability of our system an 
example of matrix multiplication is included. 
75 Problems addressed by our MP Diagonal-Fold Tree Array Processor. 

It is a problem for massively parallel array processors to attack adequately the matrix processing 
problems which exist. 

SUMMARY OF THE INVENTION 

20 

Our newly developed computer system may be described as a Massively Parallel (MP) Diagonal- Fold 
Tree Array Processor which operates in a Single Instruction Multiple Data (SIMD) fashion with general 
purpose application capability. The MP system we prefer will have a A£ Processor Element (PE) structure in 
which each PE contains instruction and data storage units, receives instructions and data, and executes 
25 instructions. The PE structure should contain N communicating ALU trees, N Tree Root Processors 
(TRP), and a mechanism for communicating both instructions and data back to the PEs by means of the 
communicating ALU trees. 

The preferred apparatus which will be described contains rV 2 PEs placed in the form of a N by N matrix, 
with PEs identified by column-row subscripts PE cotl ^ njxtmr = PE H , that has been folded along the diagonal 
30 and made up of Dtagonal-PEs and General-PEs. 

In our preferred system, the Diagonal-PEs are comprised of single Processing Elements, PE iU and the 
General-PEs are comprised of two symmetric Processing Elements. PEu and PE Mt that are merged together 
and which are associated with the same PE elements ol the N by N PE array prior to folding. 

Our new organization of PEs and new PE architecture is described in the best way we know to 
35 implement the improvements with an example implementation for matrix multiplication and discussion 
concerning neural network emulation, matrix addition, and Boolean operations. 

These and other improvements are set forth in the following detailed description. However, specifically 
as to the improvements, advantages and features described herein, reference will be made in the 
description which follows to the below-described drawings. 

40 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIGURE 1 illustrates a vector matrix multiplication operation; 
FIGURE 2 illustrates general matrix multiplication; 
45 FIGURE 3 shows a 2 value multiplication structure in two parts. FIGURE 3-A {Diagonal Cell) and FIGURE 
3-B (General Cell); 

FIGURE 3 illustrates our preferred Processor architecture in two parts, FIGURE 3C (DIAGONAL-PE) and 
FIGURE 3D (GENERAL-PE); while 
FIGURE 4 shows a preferred communicating ALU tree, 
so FIGURE 5 illustrates a 4 2 PE and 4- Root Tree Processor Massively Parallel Diagonal- Fold Tree Array 
Processor; 

FIGURE 6 illustrates a Processor Element tagged instruction/data format; while 
FIGURE 7 illustrates a Processor Element Example Instruction Set. 

55 MATRIX PROCESSING BACKGROUND 

A vector matrix multiplication operation utilizing a sum of product calculation especially suited for our 
preferred MP organization is shown in - Fig 'MATRXV unknown - where there are i columns and j rows. 

3 
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The input matrix z i is defined as: 
Zf-YtW n + YzWa+.+YHWiH 

s This is a subset of the general case of matrix multiplications. Consider the three N x N matrices Y W and 
<Y • W) result matrix z, as shown in - Fig 'MATRX2* unknown with j columns and j rows and a notation of 
>Wum,ur™> It is assumed that a MP Diagonal-Fold Tree Array Processor is available with the foltowino 
assumed capabilities: v 
- N Root Tree Processors: each possessing a Y vahie memory capacity of N Y values and an additional 

w memory capacity for N result values. 

• rV 2 PEs with the W values to be stored In internal PE registers. 

• The Root Tree Processor complex issues all instructions in broadcast mode 

n^L^T^^^ described re ^ nts "*Y « of many possibilities and is not 

necessarHy the best procedure depending upon an application. It Is meant to demonstrate the capability 
of our MP Diagonal-Fold Tree Array Processor. The basic procedure is to have N Root Tree Processors 
send a row of the Y matrix and a multiplication instruction, with the Auto mode specified, to the PEs which 
execute the multiphcatlon and send the results to the CATs for summation, priding a row of the Tesu! 
Z * h 6 R °? l Jr, Pr0CeSSOfS f ° r ^ *" T ™ then read o"a new row o 

Te CAjIZTt 1 t0 PES> COn " nuin ° " neratin9 a row of th * at the output of 

the CATs and stonng the row in memory until all result rows have been calculated The W value matrix 
once m.tialized in the PEs, remains fixed, internal to the PEs throughout the iSS^ 

PROCESSOR ELEMENT ARCHITECTURE 

Internally, a triangular scalable neural array processor TSNAP structure utilized two types of "cell" 

ZTZV^T^ 2 G «™* C *»> fo ' «*• *«* emulation of the neurTsTof Jodui 
function and d,d not address the processing of locally stored data, for example as required by learnino 
algonthms see - Rumelhart 86. The basic multiplier element structures, without SwmatoZ Te 

PE and me General-PE. The modifications to the basic processing structure. - Fg 'SPA* unknown 

ffS^/cSZTiS ,? 8 execution °' an ins,ruction - whose resu,t 

whX ,s condrt, ona) based upon the state of the destination register's CEB. The CEB Indicate* 
decoded PE s rnstrucbon regrster received from either an optional instruction buffer or with 
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no instruction buffer from the attached communicating ALU tree that is in a communications mode. Each PE 
upon receipt of an instruction in the instruction register will execute the operation specified by that 
instruction. The instruction types include a data/instruction path specification, data movement, arithmetic, 
and logicai instructions. Each PE contains an instruction register for each processing element specifying the 
5 source and destination paths and EXU functions; a Diagonal-PE contains one instruction register and the 
General-PE contains two instruction registers. 

The modification to the T-SNAP cells must preserve the functional capabilities provided by the original 
cells, inorder to support neural emulation as well as other applications requiring similar capabilities. An 
essential, novel, and general purpose functional capability provided by the T-SNAP multiplier ceils, that 
io must be maintained in the new processor cell structure, concerns the emulation of completely connected 
processors, for example neuron processor as used for completely connected networks such as Hopfiefd 82 
and Hopfield 84. This important function is briefly reviewed using the original T-SNAP cells — Fig *SPA1 ' 
unknown -A and B.. For example, with a neural network model in an execution mode, implying a 
multiplication operation in each processing cell, the diagonal ceil multiplies its stored weight with its stored 
is Y value and supplies the multiplied resuft to the attached add tree. In the communications mode for the 
diagonal cells, a Y value is received from, the attached add tree and stored into the Y value register. The 
"General-Cells" of the structure aJso generate a weight times Y value and supply the product to their 
attached add trees. In the communications mode for these "General-Cells", a Y f value received from the 
bottom multiplier add tree is stored into the top Y value register and likewise a V/ value received from the 
20 top multiplier add tree will be stored into the bottom Y value register. This switch in storing the Y values is 
an essential characteristic supporting complete connectivity. For the modified processing cells, — Fig *SPA1 • 
unknown ~C and D, this path switch is programmable allowing further unique architectural features for 
processing, as will be described in the Processor Element Instruction Set section of this Chapter. To 
preserve the internal path switch function of the original T-SNAP cells, the new processor cells require that 
25 the data path registers be specified (loaded) in advance of receiving data from a tree. The data path register 
specifies the destination of the Yj data received from the bottom tree to be the top Yj register and the 
destination of the Yl data received from the top add tree to be the bottom Yi register thereby preserving the 
complete connectivity function. 

The symbolic summation tree is shown on the left of - Rg 'TREEV unknown - with ALUs at each 
30 stage designated by the letter A. The more detailed representation of the communicating ALU tree structure 
that will be used is shown on the right-hand side of - Rg 'TREE! 1 unknown Pipeline latches have been 
left out for more clarity. For specific applications, the ALU function might be as simple as a bit-serial adder 
or provide more complex programmable functions requiring an instruction set architecture. For the purposes 
of describing the function execution and communications operations a summation operation may be 
as referred to in this text. The use of the summation function is for simplicity of explanation and not intended 
to imply a limit to the functionality the communicating ALU tree can provide. In addition, the tree nodes 4 
control mechanism, that determines the nodes operational mode and function, can use separate control 
lines or tagged tree node instructions. For a single node function such as addition and two operational 
modes, namely communictions and function execution, a single control line implementation is feasible, rf 
40 more extended functions are to be supported in a tree node, then not only would additional control 
mechanisms be required but storage elements may be required in a tree node. In addition, if multiple 
functions are provided in the tree nodes then a method of synchronistically controlling tree operations must 
be utilized. If varying function execution timings are to be allowed in each tree node then an asynchronous 
interfacing method must be provided between the tree stages. For simplicity of implementation that 
45 guarantees the synchronization control, a restriction could be enforced that the same operation be specified 
for each tree stage. In - Fig TREEV unknown - three ALU elements are shown in a 2 stage pipelined tree 
arrangement. The ALU element has a S Witch 1, SW1, block on its output and two SWrtch 2s, SW2, blocks 
bypassing the ALU. The communicating ALU tree can be placed into one of two modes, namely a function 
execution mode and a communications mode, also termed a bypass mode. A common control signal is 
so used at each ALU element in order to guarantee that all nodes of the tree provide the same mode of 
operation. One ol the functions specified by the tree control signal, an accompanying tag signaJ or common 
distributed signal, is the ALU bypass. Both switches. SW1 and SW2, have an on/off control which, when in 
the "off" state, keeps the switch open, i.e. in a high Impedance stare and when in the "on" state bypasses 
the ALU (node function) via a Jow impedance path. When SW1 is enabled SW2 is disabled and vice versa. 
55 In this manner the ALU tree can provide the summation function, for example, in one direction. SWTs on - 
SW2's off, while essentially acting as a communication path in ALU bypass mode, SWVs off - SW2's on. 
The ALU tree using 2 to 1 functional elements, such as 2-1 adders, will require Jog2rV stages. Alternatively, 
the ALU function and communication mode can be implemented with 3-1 , 4-1 N-1 functional elements, 

5 
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such as 3-1, 4-1 N-i adders, and their bypass switches, utilizing ail the same element hypes or In 

combination, to produce the specified function, it should be noted that the Communicating ALU, « Rg 
TREE, unknown ~ represents its logical function since, for example, depending upon technology, the 
5 funct,on COuW be incorporated in the gate devices used in the last internal stage of each ALU 
element thereby adding no additional delay to the ALU function. Alternatively, a separate communications 
tree path could be provided, thereby allowing communications to occur white an ALU function is in 
progress. 

A 4 Root Tree Processor example is shown in - Fig 'SPA4N' unknown - which connects the sixteen 
PEs with four CATs end four Hoot Tree Processors with a Host interface to provide a complete picture of 
the machine organization used in the Massively Parallel Diegonal-Fold Tree Array Processor The CATs are 
™ !J »T°v. de 8 S " mmat, J on function in ^ BX8Ctui «" ""Ode. A" example of the elements involved in 

isszLZ ^v^zr* u,ation for <ho ™ Root t ~ p — rtp > * *-» «■ 



.s RTP Z =F(W 3 .,Y, * W 33 Y 2 * W 3 , a Y 3 * W 3A Y,) 



The Host interface represent a central control point for the array of PEs allowing the Host to have access to 
me Root Tree Processors-internet storage possibly containing, for example, the initial parameters wTeL: 
calculated values, and traced values. There is assumed to be a Root Tree Pro^r for et^h 
,o commumcatingAunction execution tree and their N attached PEs. Each Root WSLl *j£ 
.nsuuchons and data to the N tree attached PEs through the communications mode of^TSaSS? 
Add,t,onal functus the Root Tree Processor and Host interface include the following- ^ 
1 All processor initializations 
2. Starting the system 
25 3. Stoping the system 

4. communicating ALU tree control 

5. PE instruction and data issuing 



20 
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In operation, the N* PE structure might require an initialization of certain registers Even thn.,rrh 

mit.alnt.on purposes. Other initialization mechanisms are clearJy^ssS^ T£S ft wlS I?JE?£ 

PROCESSOR ELEMENT INSTRUCTION SET 

An example instruction set providing the previously discussed capability will be reviewed in this «^r* rt 

0. A broadcast messaae/data nn** tn a .i M r>_ P1 anoa wggeo rnstrucbon/data for B = 

sor cements belonging to a Root Tree Processor * to be accomplished. Alternatively, groups 
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of Processor Elements can utilize the same tag value, thereby uniquely identifying groups of PEs. The 
received tag is bit by bit compared with a stored tag in each PE. After the last tag bit compare is 
completed, It is known whether the following JNSTft/DATA is to be received by that particular PE. A tag 
match results in a INSTR or Data being received, white a no match situation prevents the reception ol a 

5 INSTR or Data. A parity bit or error correction bits denoted by a P can also be included in the tag field, as 
shown in - Fig 'SPA4' unknown --. for error handling reasons. 

The communicated instructions or data also contain a single bit (INSTR) indicating whether the bit string 
is data or instruction, and for instructions additional fields for specifying an automatic execution mode 
(AUTO), instruction opcode (INSTR), operand selection (SOURCEl and SOURCE2), and result destination 

10 (DESTINATION). Error correction/detection bit/s (ECC) can be included on both instructions and data for 
error handling reasons. It is assumed that the instruction and data bit lengths are the same. - Fig 'SPAS' 
unknown — lists the present instruction set functions. 

The instruction set may contain, arithmetic operations, for example, add, subtract, multiply, divide, 
square root, etc.. logical operations, for example, AND, OR, EX-OR. Invert, etc.. Compare, shift, and data 

is storage movement operations. The instruction set is primarily determined from an application specific 
perspective. 

A fairly standard instruction format is used with the unique addition of the AUTO bit as representing an 
automatic execution mode. The auto execution mode represents a capability that switches the execution 
mode of the PEs from an instructions execution only mode to a data dependent mode of execution. The 
20 control of the switch from a control flow execution mode to a data flow execution mode Is programmable by 
use of the AUTO bit to engage the data flow mode and a rule that allows the return to control flow 
instruction execution mode. An instruction with the AUTO bit active is executed first due to normal 
instruction control flow execution sequencing and then It is executed each time valid data is received in the 
processing unit. The data flow execution continues until a new instruction is received which stops the 
25 previous "AUTO" instruction from executing and begins the execution of the newly received instruction, 
which may also be another AUTO instruction. 

To demonstrate the importance of the AUTO mode tor processing, a simple example using the Hopfield 
neural network wiil be presented. For this discussion, instruction mnemonics, as presented in the example 
instruction set architecture of - Fig *SPA5* unknown are used. Assume the Hopfield neural network 
30 model - see Hopfield 84 - is used as an example, for the direct emulation of the network neurons sum of 
connection weight times connecting neuron output values. Each network update cycle consists of weight 
times Y value multiplication operations, summation of multiplication results, the generation of the nonlinear 
sigmoid neuron output Y values, and the communication of the generated Y values to the processing 
elements. The network updates continue until a network minimum is reached. For simplicity of discussion, 
as assume that network convergence is not tested for on every cycle, but only after some multiple cycles have 
been executed. For the network emulation using the Processor Elements, an automatic mode can be 
specified where, instead of requiring the repeated sending of a Multiply instruction to the PEs after each 
network execution cycle in order to initiate the next network cycle, the automatic mode would begin the next 
update cycle automatically after receipt of the newly calculated Y values. This automatic mode is initiated 
40 by setting the AUTO bit to a "1 " in the instruction desired, such as Multiply (MPY) for use in the Hopfield 
network example, which sets an automatic mode flag in the PEs. The first operation is initiated with the 
receipt of the instruction with the AUTO bit set to a "1" and the instruction would be repeatedly executed 
upon receipt of the new updated data continuing until a new instruction is received which terminates the 
automatic mode, such as receipt of a NOP instruction. A capital A is appended to an instruction mnemonic 
45 to indicate that the auto bit is to be set to a "1 for example MPY A. 

The source and destination addresses specified in -Fig 'SPA5' unknown - are relative to the instruction 
register where the instruction is received. The relative addressing is shown in - Fig *SPAV unknown -D 
where the top instruction register INSTR TREG relative addresses are shown in columnar fashion, located to 
the right of the register blocks, while the relative addressing for the bottom instruction register fNSTR BREG 
so is shown in columnar fashion, located to the left of the register blocks. For m k m temporary or working 
registers, it should be noted for example, that the bottom instruction register R2 is the same as the top 
instruction register R(2 + k + 1). A bit string received from the ALU tree, if it is an instruction, is serialized 
into one of the two INSTR registers In each General-cell, as directed by the INSTR PATH BIT, and the 
single INSTR register of a Diagonal-Cell. A data bit string received from the ALU tree, is serialized to one of 
55 the k + 4 other registers available in a General-cell and one ol the k/2«-2 other registers available in a 
Diagonal-Celt as specified by the DATA PATH register. It is assumed that for a symmetrical structure the 
Diagonal-Cells contain half the number of instruction and data registers as compared to the General-Cells. 
In the Dlagonal-PEs a source or destination address of R(2 + k/2 + 1) through R(2 + k+2) and CR2 are 

7 
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mapped as follows: 

• R(2 + k/2-M>- R(2 + k/2) 

• R(2 + k/2 + 2)- R(2+k/2-|) 

• continuing 

5 » R(2 + k/2 + k/2 + 2> = R(2+k+2)-R(2 + k/2-k/2-1)*Ri 

• CR2 - CR1 

For example assume a k = 2 working registers in the General-Cells and three bit source or destination 
address then having the General-Cells use all three bits and the Diagonal-Cells use only the 2 Isb bits the 
proper mapping can be provided by: 
jo • 000 — CR1 

• 001 - R1 

• 010 - R2 

• 011 - R3 

• 100— CR2 
is » 101 - R6 

• 110- R5 

• 111 - R4 

The PATH instruction is treated differently from the other instructions, since it controls the instruction oath 
select.cn mechanism. The PATH instruction is decoded prior to the distributor logic 1 mS? ^nknSl 

20 To" a pIth^ T St8r " ,NSTR PATH B,T ^ ~** to field i ElTSS 

for reg,ster path selection, others formats are clearly possible. The PATH instruction must be reissued if a 
different pam , s des.red. A default path is specified by the architecture for WflaHzain p^S? or 
examplejie DATA PATH registers could be initialized to R5, the Y value register toiSI 

see Rn ?Sn«^L? = 1> tS ^ ,Sn9th * t0 ^ COndltj ° nal executi0n bit j " «* 'ata regisT 
inlr,t^ . - Sh ° Wm9 me i53tructi0n 30(1 formats ' * a CEB is set to a "zero" in an 

rnstruct ons destination reg.ster. that instruction will be treated as a NOP instruction ie L destination 
register's contents will not be changed and "zeros" will be fed to the Add tree ^ CEB is set S?"f^ 
so the registers contents can be modified. For example, this bit is useTon the 

presence or absence of a value since a zero value is not always sufficient to acc^S "Zero? 2 

with m Roc, T,ee ProcTo* is £JX^^^ - *— ™T 

8 
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..Many instructions specify a destination which Is local to the individual Processor Element. This local 
processing can cause synchronization problems If not handled correctly. Instead of proliferating synchro- 
nization mechanisms throughout the structure the local processing synchronization problem can be 
localized to the Root Tree Processors. For example, if no notification of local processing completion is 

5 generated Irom the PEs, a fixed liardware mechanism can be provided at the Root Tree Processor to 
guarantee safeness of the operations. It is also not desirable to "solve" the problem via means of queues in 
the Processor Elements as this increases the size of the PE limiting the number which could be placed on a 
single chip. Rather, the instruction issuing point should be used to resolve and avoid all hazards. Any local 
processing instruction to the same PE must be separated from the next instruction to that same PE by the 

io specified processor instruction's execution time. For example, if the multiply executed in 2L clocks, a 2L 
time out must be ensured prior to sending the next instruction. This is necessary so that an instruction 
buffor register is not required, thereby allowing each instruction to remain constant in a PE during the 
operation of the function instruct! oned. Each Root Tree Processor can then be set up with a synchronization 
mechanism to safely issue instructions to each PE at a maximum rate. Non-local instructions, i.e. those 

75 instructions where the destination is the ADD TREE, provide notification of operation completion when the 
converged tree resutt reaches the Root Tree Processors. For non-local instructions the Root Tree 
Processors wart until a result is received before sending a new instruction to the PEs attached to that tree. 

As a final note, a compiler would be required to ensure no destination conflicts occur in programs using 
the described instruction set 

20 

MATRIX PROCESSING EXAMPLE 

The following detailed procedure will be followed, see PE example instruction sets: (Note thai the PE 
instructions are indicated by PE- Instruction Mneumonic and don't care states indicated by (x).) 
25 1 . CATS placed into communication mode 

2. Initialize W matrix into PE registers 

3. Each Root Tree Processor memory is initialized as follows: 

33 • Root Tree Processor 1 initialized with Yll, Y12, 

YIN. 

• Root Tree Processor 2 initialized with Y21, Y22 # 
Y2N. 



<° • Root Tree Processor N initialized with YN1 , YN2 

YNN, 

4. Initialize the Root Tree Processor and PE PATH registers: 
45 • Set PE INSTR PATH Bit to CR2 indicating YOUT mode 

• Set PE DATA PATH to R2 

5. All Root Tree Processors are active 

6. Root Tree Processors send the first row of Y values to the PEs. 

7. Root Tree Processors send PE-MPYA RrR2 — ADD TREE, after sending the instruction to the PEs. 
so the Root Tree Processors place the CATs into the summation mode. 

8. Root Tree Processors receive the summation results from the CAT roots. 

9. Root Tree Processors send the second row of Y values to the PEs. 

10. While the PEs and CATs are calculating the next row of the result matrix, the Root Tree Processors 
can store the first row of the result matrix, in this example to the additional storage capacity assumed 

55 present. 

11. Since the Auto mcxte was specified in the PE-MPYA instruction, the PEs upon receipt of the second 
row Y values will automatically execute a PE-MPY R1*R2 -* ADD TREE with CAT results to be received 
in the Root Tree Processors. 

9 
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12. The Root Tree Processors send me third row of Y values and stores the second row of the resist 
matrix. The process continues until 

13. Root Tree Processors store the last row of the result matrix. 

At completion of operation, the original Y and W matrices are intact and the result matrix is located in 
s the Root Tree Processors' additional storage area which can then be further operated upon by the Root 
Tree Processors or Host system. 

Matrix addition and Boolean operations can also be supported by the structure. Assuming matrices of 
the same form as g,ven in - Fig 'MATRX2' unknown - both Y and W matrices can be loaded into the PE 

,„ J, U ™ qUe Y ,nd W re9iStarS in me structure " LoCBl Edition or Boolean operations on 

,he W . r f9 ,sters can 09 don « wW* fe structure with the result sent to the temporary registers At 

complet.cn of the operation, the original Y and W matrices will remain intact in the structure and the temp 
regs w.ll conta.n the result matrix. The result can be scanned out or individually read out from the Processor 
Element cells or usod for further operations (chaining or linking of instructions). 

is Claims 

1 " tj^TT'. SySt8 . m aPPar8,US f ° r 9SneraJ a PP |ic ^ns. including matrix processing, is 

C °Z " S ° ( r0 ° t « e * Or0cossors ' communicating ALU trees, processing elements (PEs). means for 
common.cating both .nstructions and data between the root tree processors and the process.™ 

™hT* ' "? Wh * r f neach processor contains ins «^<»" »«1 data storage unite, receives instruction 
and data, and execute instructions. 

" nV^T* 7TZ " 9 , 1 ,Urth8r comprisin » * P roc ^"0 ^ente. pfaced in the form of a 

6;^Ll JT h ?^ W ' m 3 t "° SubScript notation Pf «*™ **« "as been folded along the 
diagonal and made up of diagonal cells and general cells. 

?! ^^Tpr a ^ din9 t0 C ' aim 2 Wherei " said diaBonal ce,,s - identified as «i* are each comprised 

and PF 9 ih f 9 i 8neral . Ce " S - ^ e8Ch C ° nnprlsea of * elements. identifiSTpT- 

and that are merged together. 00 c *' 

The apparatus according to claim 3 wherein the diagonal cells' single PE are each comprised of a tao 

mf ns ' n nf Un eS ?" a ! 0n C0^,r0, ' neChanism ,W a^nal? received InSuSs ^dl4S 
means of an .nstruction/data decoding mechanism, a data path storage unit and a diioTunif 
induction storage units comprised of an instruction buffer Z stored Z X^nsJuS and one' 
-nsbxrction storage un.t used for instruction decode and operationat control, mlpTd^ZgTunite 
nism T meCha " iSm COntro,te,J «* * an instrucL oZT^l 

de2L n k ^ ad * eSSin9 mSanS r9,ative ,0 *• 08C « ted Action storS uSt a^utt 
Sh^T P , C ! n,r °' mechanism ""WW t>V means of an Instruction decoding mthanJsm and a 
distnbutor unit, and a programmable execution unit. mecnarasm ano a 

5. The apparatus according to daim 3 wherein the diagonal cells' PEs suoolv result m ^ 
instructions and data from an attached communicating ALU tree! **' V8 

6. The apparatus according to claim 3 wherein the general cells' two PE* pf =«h pc 

rwcoivea instructions and data by means of two instruction/data decodino mechanisms twn rf«t a o^k 

If* P ^, s ' lwo «°™fl" operand selection mechanisms controlled by means of two 
-nstruct.on decoding mechanisms, a common selector unit and two add««inn TLZL Tl!? . 
two decoded instruction storage unite, hvo resuft destinaion^rco^o^^s cotolled^ 

55 
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a .The apparatus according to claim 1 wherein the PEs' data storage units contain conditional execution 
bits with one bit per data storage unit which said bit controls the use of the data and whether the data 
may be overwritten. 

6 9. The apparatus according to claim 1 wherein the root tree processor provides function execution upon 
data supplied from the communicating ALU tees in a function execution mode and supplies 
instructions/data to the communicating ALU trees in a communications mode. 

10. The apparatus according to claim 1 wherein binary communicating ALU trees contain logzrV 2 to 1 
n communicating ALU stages. 

11- The apparatus according to damn 10 wherein each stage rn the communicating ALU trees contain 2 to 
1 communicating ALUs comprised of a 2 to 1 ALU, an ALU bypass path for the purposes of 
communicating values in an opposite direction than that used for the ALU execution, and means for 
;5 switching between the ALU function and the communication's path. 

12. The apparatus according to claim 1 wherein the communicating ALU trees each connect to an 
additional ALU stage wherein an external input vaJue Is processed with the output of the communicating 
ALU tree and said additional ALU stage provides results to the root tree processors. 

20 

13. The apparatus according to claim t wherein the root tree processors and their Host computer interface 
provides the following functions: 

• communicating ALU tree control 

• PE initializations 

?s • PE instruction issuing 

• algorithmic data calculations 

• PE data issuing 

• synchronously starting the PEs into execution mode 

• synchronously stopping the PEs 

30 

14. The root tree processor controlfing apparatus of claim 13 contains a multiple storage arrays cor- 
responding to the PEs storage units supporting initialization procedures, result storage, and tracing 
operations. 

35 15. The apparatus according to claim 1 wherein there are fif PEs, N communicating ALU trees, and N root 
tree processors f or a N anay structure. 

16. The apparatus according to claim 15 wherein each communicating ALU tree connects to N PEs at the 
leaf nodes of the tree and one root tree processor which connects to the root of the tree providing 
40 results to a Host interface and where said communicating ALU trees, PEs, and root tree processors 
constituting the N array structure have: 

• means for inputting data vaJues to each PE, 

• means for communicating tagged instructions and data to the PEs from the root tree processing 
units, 

45 • means for controlling the destination of instructions and data in each PE, 

• means for the execution of the received instructions in each PE, 

• means for the execution of a previously received Instruction when, in an auto mode, data is 
received to be used in the next operation, 

• means for operand selection and destination path control allowing results to stay locally in each 
so PE or to be sent to the attached communicating ALU tree. 

• means for the converged function execution of vafues received from the multiple PEs. 

• means for the Inputting of external data values to each root tree processor, 

• means for the generation of new instructions and data. 

55 17. The apparatus according to claim 16 wherein the means for inputting data values to each PE comprises 
a host interface controlling mechanism in the form of a root tree processor and its programmable 
processor controlling apparatus which has access to each data value storage unit in each PE. 
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18. The apparatus according to claim 16 wherein the means for communicating lagged instructions and 
data to the PEs from the root tree processor Is by means of the communicating ALU trees acting in a 
communications mode and tag matching units in each PE wherein the tag comprises a broadcast bit 
and a tag address field. 

19. The apparatus acceding to claim 16 wherein the means for controlling the destination of instructions 
and data in each PE is. for instructions, by an instruction decoding mechanism, an instruction path bit 
and d.stobutor logic in the general cells and by an instruction decoding mechanism, register mapping 
logic whereby general cell specified registers are mapped to diagonal cell registers, and distributor 
logic m the diagonal cells and. for data, by a data decoding mechanism and a data path storage unit in 
both the diagonal cells and the general cells. «w>«w» unu in 

2 °' JS ™ Tt* a ZT°V° C ' aim 16 Wherein in ono mode °' "P"** * *° 8*»«l ceils, termed the 
IlT^ L * T P stora 9 6unite «« th ° ftirtniefcn path bits are set up such that instructions 
nZ^nl? 1 P T mm T Cat, " g ALU tr6e 3re direCt8d t0 0,9 to P PE '« in *^on storage unit and 
naSl * , b ° tt0m C ° mmUnlC8,in 9 ALU ™ erected to the bottom P£- 8 

Ettl«L , data received from the top communicating ALU tree are directed to the top 
PEs specified data storage unit and data received from the bottom communicating ALU tree are 
directed to the bottom PE's specified data storage unit. 

21. The apparatus according to claim 19 wherein in e second mode of operation of the General cells 
termed the YOUT mode, the data path storage units and the instruction S^ t^T^m 
*™ ^ t0p communicaU "9 ALU tree are directed tothe bodom PE?Sc Z 
i£%TJH d i nSlrUC,,0nS reCeived hom bo «™ communicating ALU tree are direct to I top 
me bS^'SSJST' f ata reCBiV8d fr0m *» t0p «■«**■ ALU tree are directed 

22. The apparatus according to claim 16 wherein the means for the execution of the rerai™rt i n * lrfnnc 
, , each PE is through a programmable execution unit responding" PATH TEX, 

auto mode flag as set by a received instruction having the capability of setfino the au^m^Tf^. * 

of the communicating ALU tree or alternate signaiing means for communing £ JESSSS? 

24. The apparatus according to claim 16 wherein the means for operand selection *nrt ^ u. 

control allowing results to stay locally in each PE or to be sent to m«^ a ^ destination path 

is by means of an instruction decoding mechanism JSSS£^StS£ TZZZT'lT 

" 1 ~"«= — S5SSTSKSS 

2Z ^^JSS^ c,aim 16 wherein me means f<x *• eeneratjon o< ne * **«*™ — — 

In JLS k ♦ 001 tFBe P rocessors ™ d a programmable controHing apparatus which interfaceTto 
an attached host computer and to the N communicating ALU trees. 'menaces to 

12 
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2B. -The programmable processor controlling apparatus of claim 27 wherein a time out state machine 
controlling mechanism is used on the instruction and data issuing mechanism to avoid hazards on the 
structure. 

s 29. The apparatus of claim 1 6 wherein the data are in a bit-serial format which for the data is, in the order 
the bits are received into a diagonal or general cell, first a broadcast bit, next a tag field, next an error 
handling bit/s, continuing with an instruction bit set to an inactive state to indicate data, a spare bit. a 
data field, and ending in error handling bits. 

jo 30. The apparatus of claim 16 wherein the instructions are in a bit-serial format which for instructions is, in 
the order the bits are received into a diagonal or general cell, first a broadcast bit, next a tag field, next 
an error handling bit/s, continuing with an instruction bit set to an active state to indicate an instruction, 
an auto bit, a instruction field indicating the instruction type, a source-1 field indicating the first operand, 
a source-2 field indicating the second operand, a destination field indicating the destination of results, 

n an immediate data field, and ending in error handling bit/s. 

31. The apparatus of claim 16 wherein there is means provided for sequentially performing matrix 
multiplications of two N by N matrices, one termed a W matrix and the other termed the Y matrix, 
where the multiplication creates a third N by N matrix, termed the z matrix, and registers are used for 

20 storage units, enabling a process when MPY indicates a multiply instruction, a destination of ALU TREE 
sends results to the attached communicating ALU tree, the root tree processors includes the Host 
interfacing function and where the process steps include: a) load W matrix (assuming N W values per 
root tree processor) b) toad first Y row by communicating Y values through the communicating ALU 
trees c) MPYA Rl"R2 - ALU TREE (Where the ALU tree has been initialized for the summation 
25 process.) d) calculate first row of resuft z matrix - multiply Y & W registers followed by summation tree 
e) store the N z values in the root tree processors 0 communicate second Y row through the 
communicating ALU trees g) when the new Y values have been received, calculate second row of the 
result z matrix - multiply Y & W registers lollowed by summation tree h) store the N z values in the root 
tree processors i) continue with row calculations until j) communicate nT Y row k) when the new Y 
30 values have been received, calculate A/" 1 row of result z matrix - multiply Y & vV registers followed by 
summation tree 1) store final row of result z matrix in the root tree processors. 

32. The apparatus of claim 16 wherein there is means provided for sequentially performing matrix addition 
of two N by N matrices, one termed a W matrix and the other termed the Y matrix, where the addition 
creates a third IM by N matrix, termed the 2 matrix, stored internally to the PEs in the temporary storage 
units, then assuming both Y and W matrices are initialized or in place due to previous calculations and 
there are A/ 2 unique Y and W storage units in the structure, the system is enabled to perform the local 
addition on the Y and W storage units, which addition is done within the PEs with the result sent to the 
PEs' temporary storage units that after completion of the addition the original Y and W matrices will 
remain intact in the structure and the temporary storage units will contain the addition result matrix that 
can be read out or used for further operations. 

33. The apparatus of claim 1 6 wherein there is means provided for sequentially performing matrix Boolean 
operations on two N by N matrices, one termed a W matrix and the other termed the Y matrix, where 

45 the Boolean operation creates a third N by N matrix, termed the z matrix, stored internally to the PEs in 
the temporary storage units, then assuming both Y and W matrices are initialized or in place due to 
previous calculations and there are /V 2 unique Y and W storage units in the structure, the system is 
enabled to perform the local Boolean operation on the Y and VV storage units, which Boolean operation 
is done within the PEs with the result sent to the PEs' temporary storage units that after completion of 

so the Boolean operation the original Y and W matrices will remain intact in the structure and the 
temporary storage units will contain the Boolean operation result matrix which can be read out or used 
for further operations. 



55 
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• N.U. 




ARITHMETIC 
(ADD.MPY.DIV, 
SORT.etc.) 


0=NO 
1=AUTO 


R1tRZ.R3.R4, 
R5.R6JMD1, 
IMD2 


R1.R2.R3.R4, 
R5.R6.IMD1. 
fMD2 


R1.R2.R3.R4, 
R5 r R6. 
ALU TREE 




LOGICAL 

(AND,0R,EX0R, 
etc.) 


0=NO 
1=AUT0 


R1,R2 t R3,R4, 
R5.R6.IMD1. 

IM02, 
CEB VECTOR 


R1.R2.R3.R4, 
R5.R6.IM01, 

IMD2. 
CMP FLAGS 


R1,R2,R3,R4, 
R5.R6, 
ALU THEE 
CEB VECTOR 




INV 


0»NO 
1-AUTO 


R1,R2.R3,R4 t 
R5.RWMD1, 

IMD2, 
CEB VECTOR 


N.U. 


Rt,R2,R3,R4. 
R5.R6. 
ALU TREE 
CEB VECTOR 




CMPR 


0=NO 
1-AUTO 


R1,R2,R3,R4. 
R5,R6,IMD1, 
IMD2 


R1,R2 f R3,R4, 
R5.R6JMD1. 
IMD2 


LT. GT, EO 
FLAGS 




SHIFT 


0-NO 
1=AUTO 


R1,R2.R3,R4, 
R5.R6.1MD1. 

IMD2 : 

IMD2 


N.U. 


N.U. 




SENDREG 


O^NO 
UAUTO 


R1.R2.R3,R4, 
R5.R6.IMD1. 
CEB VECTOR, 
CMP FLAGS 


N.U. 


ALU TREE 





♦AUTO* « 1 — AUTOMATIC REPEAT OF FUNCTION AFTER RECEIPT OF 



PC.7A 
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IMMED.DATA 



NOT USED 
(N.U.) 



COMMENTS 



N.U. 



DATA 



DATA 



IF DESTINATION IS CR1 SET THE CMD PATH 
BIT TO A 0. IF IT IS CR2 SET THE CMD 
PATH BIT TO A 1. (CEB FIELD NOT USED) 
ELSE SET THE OATA PATH REGISTER TO THE 
DESTINATION ADDRESS AND THE DESTINATION 
REGISTER'S CEB AS SPECIFIED. 



NO OPERATION 



IMD1/2 = CMD REG 1/2 IMMEDIATE DATA 



CEB VECTOR = (CEB1.CEB2. .... CEB6) 
WERE CEBx = CEB BIT FOR REGISTER Rx 



DATA 



DATA 



SHIFT TYPE 
& 
SHIFT 
AMOUNT 



LT = SOURCE-1 SOURCE-2 
GT = SOURCE-1 SOURCE-2 
£Q = SOURCE-1 - SOURCE-2 



N.U. 



THE FIRST PART OF THE IMMEDIATE DATA 
SPECIFIES TYPE OF SHIFT OPERATIONS, eg. 

WITH OR WITHOUT WRAPAROUND. THE SECOND! 

PART SPECIFIES THE NUMBER OF BIT SHIFTS 



IF SOURCE-1 = CEB VECTOR THE SIX CEB 
BITS ARE PACKED INTO THE MSB BITS OF 
THE RESPONSE. 
IF SOURCE-l = CMP FLAGS THE THREE Rac| 
BITS ARE PACKED INTO THE MSB BITS OF 
THE RESPONSE. 



FIG. 


FIG. 


7A 


7B 



UPDATED DATA FROM SOURCE EXTERNAL TO PROCESSOR ELEMENT 



FIG. 7 



Flg.7p 



JID <EP_05£9763A2J_> ^ 
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EP 93 10 6730 



DOCUMENTS CONSIDERED TO BE RELEVANT 



Ckumm * decuman with ndnalMa, where epeeeeriitr. 



VO-A-91 18351 (IBM CORP) 28 November 1991 
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claims 1,3,22,25 * 

US-A-4 514 807 (T. N0GI) 30 April 1985 

* claims 1,2; figure 1 * 

GB-A-2 219 106 (THE SECRETARY OF STATE FOR 
DEFENCE) 29 November 1989 

* the whole document * 

IBM TECHNICAL DISCLOSURE BULLETIN. 

vol. 34, no. 10A , March 1992 , NEW YORK 

US 

pages 100 - 106 
'Many SNAP' 
the whole document * 

IEEE INTERNATIONAL CONFERENCE ON COMPUTER 
DESIGN : VLSI IN COMPUTERS 6 October 1986 
, PORT CHESTER, USA 
pages 269 - 274 

D KOPPELMAN ET AL 'The implementation of a 
triangular permutation network using wafer 
scale integration 1 

* the whole document * 
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